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Correlating electronic health record concepts 
with healthcare process events 



George Hripcsak, David J Albers 

ABSTRACT 

Objective To study the relation between electronic 
health record (EHR) variables and healthcare process 
events. 

Materials and methods Lagged linear correlation 
was calculated between five healthcare process events 
and 84 EHR variables (24 clinical laboratory values and 
60 clinical concepts extracted from clinical notes) in a 
24-year database. The EHR variables were clustered for 
each healthcare process event and interpreted. 
Results Laboratory tests tended to cluster together and 
note concepts tended to cluster together. Within each of 
those two classes, the variables clustered into clinically 
sensible groupings. The exact groupings varied from 
healthcare process event to event, with the largest 
differences occurring between inpatient events and 
outpatient events. 

Discussion Unlike previously reported pairwise 
associations between variables, which highlighted 
correlations across the laboratory-clinical note divide, 
incorporating healthcare process events appeared to be 
sensitive to the manner in which the variables were 
collected. 

Conclusion We believe that it may be possible to 
exploit this sensitivity to help knowledge engineers select 
variables and correct for biases. 



INTRODUCTION 

The national push for electronic health records 
(EHR)^ should eventually lead to the documenta- 
tion of approximately one billion patient visits per 
year in the USA and should represent a boon to 
observational research. One of the major challenges 
to reusing EHR data comes from the inaccuracy, 
incompleteness, complexity, and resulting bias 
inherent in the recording of the healthcare 
process.^ Therefore, EHR data cannot be treated 
simply as research data w^ith noise and missing 
values; instead, the EHR carries systematic biases 
that must be addressed before the data can reach 
their potential. 

The state of the art in generating phenotypes 
from EHR data is to use a heuristic, iterative 
approach.^ The Electronic Medical Records and 
Genomics (eMERGE) Network^ and the 
Observational Medical Outcomes Partnership 
(OMOP)4 provide two large-scale examples. Eor 
example, clinical experts may be enlisted to identify 
a subset of subjects relevant to a phenotype. A 
knoM^ledge engineer then generates a heuristic rule 
that maps EHR data (such as physician notes, 
billing codes, and laboratory tests) to variables in 
the study. The rule is tested on the subset, and it is 
modified iteratively until sensitivity and specificity 
reach some threshold. The rule is eventually 



applied to the entire cohort. Unfortunately, these 
methods are themselves time consuming;^ there is 
much information that is not used, and know^ledge 
engineers and clinical experts bring their own 
biases. The process can take months. 

We believe that the path forw^ard involves system- 
izing the phenotyping process w^ith the hope of 
future automation or partial automation. We hope 
to understand better how^ the healthcare process 
affects the recording of clinical information in the 
EHR so that w^e can improve and perhaps speed 
the generation of phenotypes. This study is a first 
step in that process. We employ our existing techni- 
ques^ to measure lagged linear correlation to study 
the association between a number of EHR variables 
and five common healthcare process events: 
inpatient admission, inpatient discharge, outpatient 
visit, emergency department visit, and ambulatory 
surgery. We then cluster variables according to 
those associations, looking for groups of variables 
that behave similarly, hypothesizing that the groups 
M^ill represent not only clinical and physiological 
properties but also characteristics related to the way 
the information is gathered and recorded. 



METHODS 

We used the Columbia University Medical Center 
clinical data w^arehouse,'^ w^hich contains 24 years 
of data on 3.6 million patients. From this w^are- 
house, w^e selected 24 laboratory tests and 60 clin- 
ical concepts derived from resident's signout notes 
to represent EHR variables (see supplement, avail- 
able online only). Signout notes are used to transfer 
care to and from overnight shifts. There w^ere 
2 301 730 notes on 213 464 patients. The labora- 
tory tests w^ere all continuous and the concepts 
were represented as 1 if they were present in a note 
and 0 if they were absent from a note. We used 
simple regular expressions of stemmed concepts to 
detect the presence of the concepts in the notes. 
We had previously found^ in this particular corpus, 
resident signout notes, that performance in finding 
correlations was excellent despite ignoring negation 
and other modifiers. Based on a manual review^ of 
notes, we find that residents simply do not use neg- 
ation frequently in the context of signing their 
service over; instead they state very concisely only 
M^hat is present. 

The tests and concepts w^ere chosen as part of 
our previous publication.^ The laboratory tests 
w^ere chosen because they were common. The con- 
cepts w^ere chosen such that they w^ere among the 
250 most common diseases, symptoms, procedures, 
medications in the signout notes and such that w^e 
expected an association betw^een the concept and 
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the laboratory tests (eg, hyperkalemia) or expected no strong 
association (eg, atelectasis). 

We selected five healthcare process events that are expected to 
be highly correlated w^ith healthcare process effects: admission 
to the hospital, discharge from the hospital, emergency depart- 
ment visit, ambulatory visit, and ambulatory surgical procedure. 
For each variable-healthcare event pair (of w^hich there are 84 
times five of them), w^e calculated lagged linear correlation iden- 
tically to our previous description.^ As before, w^e used Hnear 
interpolation to generate values betw^een recorded time points 
so that we could get a correlation for two variables that do not 
have observations on the same days. We normalized the EHR 
variables w^ithin each patient by taking each patient's values for 
that variable, subtracting the mean and dividing by the SD. 
Patients w^ith few^er than three values w^ere dropped. We did not 
normalize the healthcare events (because it w^ould be redun- 
dant). The effect is to remove inter-patient effects, leaving only 
the intra-patient effects so that each patient effectively acts as 
his ow^n control. 

We used lagged linear correlation as quantified by cross- 
covariance. Lagged linear correlation is a simple, robust, and 
commonly used technique that is highly related to powder spec- 
tral analysis^ and is given by: 

p.(x,Y) = ^K^--^><)(^--^^)' (1) 

axOY 

p(X,Y) = {p_,o(X,Y),...,p6o(X,Y)} (2) 

w^here X and Y are time series, and are points at time t, x 
is a lag (here in days), |ix? 1^^ C)x? and ay are the means and var- 
iances of X and Y, respectively, Pt:(X,Y) is linear correlation at 
lag X, and p(X,Y) is the resulting correlation curve from -60 to 
-\-60 days. This method registers positive for positive correl- 
ation (both values are large and have the same sign) and nega- 
tive for negative correlation (both values are large and have 
opposite signs). The lag captures how^ the correlation betw^een 
the variables changes w^hen the variables are moved in and out 
of synchronization w^ith one another. This is explained in more 
detail in Koopmans^ and StengeP and in the healthcare 
context in Hripcsak et al.^ When choosing this method, the 
metric is predefined to be the Euclidean, least squares, or I2 
metric. The outcome of this is a curve for each variable X 
(eg, glucose) relative to each healthcare context Y (eg, dis- 
charge, outpatient). 

We then clustered the lagged linear correlation curves. While 
we believe that there is likely to be a great deal of information 
lurking in these curves, as a first step w^e clustered by similarity 
of curves. That is, w^e w^anted curves that looked the same to be 
grouped together. Again, there are many methods (eg, k-means, 
spectral clustering, hierarchical clustering, standard classifica- 
tion, etc.) for decomposing this space, and each method is 
dependent both on a metric for specifying similarity and a char- 
acteristic or set of characteristics to cluster by.^^ Because w^e 
v^ant to understand how^ the different curves are related as a 
function of the distance between the lagged linear correlation 
curves, we imposed a similarity-dependent hierarchical structure 
that is agglomerative^^ — meaning, w^e began w^ith each obser- 
vation (correlation curve) and merged the observations based on 
the distances between the curves. 

To achieve this, w^e used an agglomerative hierarchical cluster- 
ing scheme, or single linkage agglomerative clustering, exe- 
cuted via three steps. Eirst, we calculated the 'dissimilarity' or 
distance betw^een lagged linear correlation curves — w^e chose to 



specify distance to be the pairw^ise Euclidean distance between 
tw^o lagged linear correlation curves, p(X,Y) and p(X',Y'): 

60 

d(p(X, Y), p(X',Y')) = ((Pr(X, Y) - p,(X', Y'))")^/' (3) 

T=-60 

w^here d is the dissimilarity betw^een the curves. We quantified 
dissimilarity between clusters as the minimum Euclidean dis- 
tance between member curves, resulting in single Hnkage.^^ 
Given two clusters, Ci and Cj, the single-link distance, dsL? 
between clusters Ci and Cj, is given by: 

dsL(Q,Cj) = minpEQ,^ECj{d(p,q)} (4) 

where p and q are correlation curves p(X,Y) for some X and Y 
Second, we clustered the curves. If we have N observations 
(lagged linear correlation curves) then w^e have N-1 steps w^here 
w^e merge the tw^o most similar (least dissimilar) clusters; or, w^e 
agglomerate the two clusters that minimized over the remaining 
elements w^ithin the clusters. That is, at the kth step w^e agglom- 
erate the two clusters, Ci and Cj of the remaining N-k clusters 
for w^hich dsL is minimized. Third, we visualized the binary 
cluster tree using a dendrogram, in w^hich the link denoting 
w^here the group is joined is the dsL- We also repeated the ana- 
lysis using average linkage^ ^ instead of single linkage to see if 
the clustering wdiS sensitive to the linkage method. 



RESULTS 

We first show^ a sample lagged linear correlation curve. Eigure 1 
show^s the curve for intravascular creatinine w^ith respect to 
inpatient admission. 

To illustrate the clusters better, we first plotted the laboratory 
tests alone. Eigure 2 show^s the clusters of laboratory values 
based on their lagged linear correlation with, inpatient admis- 
sion. We see mostly logical groupings of variables, w^ith groups 
of coagulation studies (partial thromboplastin time (PTT), pro- 
thrombin time (PT), and international normalized ratio (INR)), 
hematological studies (red blood cell count, hemoglobin, hem- 
atocrit), renal studies (urea nitrogen, creatinine), and liver and 
gastrointestinal studies (amylase, lipase, bilirubin, alanine amino- 
transferase, aspartate aminotransferase). In some cases the tests 
are naturally ordered together and track each other (hemoglobin 
and calculated hematocrit being an extreme example). While in 
many cases the tests are performed together in a battery (eg, red 
blood cell count, hemoglobin, hematocrit) in other cases they 
are not necessarily ordered together (eg, PTT, PT, INR). Similar 
clusters are found for inpatient discharge events (see supple- 
ment, available online only). 

Eigure 3 show^s the clusters for ambulatory surgery events. 
Note the change in the clusters, w^ith PTT and INR still close 
but M^ith PT in the distance. One might expect INR and PT to 
remain close because they are the same test other than the fact 
that the former is normalized, w^hereas PTT measures a different 
coagulation pathw^ay. The clusters may therefore have more to 
do with physician ordering patterns (both whdX is ordered 
together and how the ordering of tests evolves over time) than 
w^ith actual values. Outpatient visits and emergency department 
visits (see supplement, available online only) led to similar clus- 
ters to ambulatory surgery, perhaps implying that the major div- 
ision is between inpatient events (admission and discharge) and 
outpatient events (ambulatory surgery, outpatient visits, emer- 
gency department). 
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0.03 ' 

Days (creatinine w.r.t. inpatient admission) 

Figure 1 Lagged linear correlation curve for intravascular creatinine versus inpatient admission. Left of 0 days implies that a change in creatinine 
preceded the admission. Points above 0 are positively correlated. The curve indicates that patients tend to have higher creatinine leading up to the 
admission (perhaps as part of their disease state), that their creatinine peaks around admission (perhaps with the acute illness), and falls after 
admission (perhaps due to treatment and recovery). 



We then plotted all 84 variables. Figure 4 shows the clusters 
for all note concepts and laboratory values based on inpatient 
discharge events. Laboratory values tend to cluster together and 
note concepts tend to cluster together. There are some excep- 
tions, w^ith PT and INR both clustering aw^ay from the other 
laboratory variables. Within data type, clinical clustering begins 
to appear. For example, ulcer, emesis, diarrhea, cirrhosis, pan- 
creatitis, nausea, and vomiting appear near each other, as do 
hypotension, lasix, atrial fibrillation (AFIB), and digoxin. 
Figure 5 show^s the clusters for emergency department events. 
Outpatient visit events are similar (see supplement, available 
online only). Clustering using average linkage instead of single 
linkage led to essentially identical clusters, w^ith figures 2 to 5 
looking almost identical and changing none of the highlighted 
examples. 

We then show^ the lagged linear correlation profiles for repre- 
sentative pairs of variables. Figure 6 show^s PTT and INR, w^hich 
is a pair that is non-trivial (the tests are not identical and not 
aWays ordered together) but that cluster together in all five 
healthcare event contexts. Figure 7 show^s the nausea and vomit- 
ing note concepts, w^hich should be medically related. Figure 8 
show^s a limitation of our clustering technique. Hemoglobin 
laboratory test and the anemia note concept are medically 
related but are not clustered together. Manual review^ of the 
graphs show^s that they in fact do match to some degree but 
their signs are reversed: a drop in hemoglobin corresponds to 
an increase in anemia. 

DISCUSSION 

The approach of clustering EHR variables based on their asso- 
ciations M^ith important healthcare process events appears to 
group variables into sensible clusters; this lends face validity to 
the approach. For example, related laboratory tests clustered 
together and related note concepts clustered together. Of 
course, if our goal w^ere to find associations, then w^e could 
simply cluster the variables directly according to their pairw^ise 



associations, and w^e have carried out that study for note-labora- 
tory pairs. ^ Our goal instead is to learn how^ the healthcare 
process affects the variables, and our current approach does 
seem to pull in the healthcare context. The fact that the clusters 
differ from figures 2 to 3 demonstrates that healthcare context 
does affect the associations, and it appears to be sensitive to the 
manner in w^hich the data w^ere collected. 

The distinction betw^een pairw^ise associations and healthcare 
associations is important to emphasize. We did not, for 
example, attempt to study the pairw^ise relationship between 
inpatient concepts (eg, extubate and intubate) in the outpatient 
setting. Instead, w^e compared how^ extubation relates to the out- 
patient setting, and how^ intubation relates to the outpatient 
setting, and then how^ those two relationships compared. For 
example, it may be that they cluster together because they are 
similarly distant from the outpatient setting. Although it is less 
common, occasionally a patient w^ill be admitted and intubated 
soon after an outpatient visit or be extubated and discharged 
w^ith a follow^-up visit soon afterw^ards, and this w^ill also be 
reflected in the correlations. 

We believe that w^e may be able to exploit these groupings for 
the phenotyping process. One of the challenges in creating phe- 
notypes is accounting for the biases of data collection. Grouping 
variables based on their associations w^ith healthcare process 
events may quickly — and on a large scale — clue the phenotype 
know^ledge engineer to variables w^ith similar biases. In this 
study, for example, laboratory values and note concepts w^ere 
separated. While that division may be obvious, on a larger scale 
it may be possible to group variables based on less obvious but 
equally important divisions in the biases that they are likely to 
have. In effect, this study serves as the measurement study that 
demonstrates the approach's ability to group and separate vari- 
ables according to their measurement properties. 

The groupings might then be used in the phenotyping 
process. The know^ledge engineer might purposely select input 
variables from a broad variety of bias types in an attempt to 
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Figure 2 Clustering of laboratory values based on their lagged linear 
correlations with inpatient admission events. The x-axis shows the 
unitless single-link distance, with length of the horizontal line in the 
dentrogram representing the distance between the connected clusters. 
We see mostly logical groupings of variables, with groups of 
coagulation studies (INR, international normalized ratio; PT, 
prothrombin time; PTT, partial thromboplastin time), hematological 
studies (RBC, red blood cell count; hemoglobin, hematocrit), renal 
studies (urea nitrogen, creatinine), and liver and gastrointestinal studies 
(ALT, alanine aminotransferase; AST, aspartate aminotransferase; 
amylase, lipase, bilirubin). CK, creatine kinase; WBC, white blood cell. 



reduce the variance of the phenotype. This will require further 
study and proof, but the intuition is that averaging several vari- 
ables M^ith different measurement properties but similar under- 
lying physiology w^ill tend to improve the signal-to-noise ratio of 
the physiological signal being measured to the noise of measure- 
ment bias. For example, based on our results, if one is creating 
an anemia phenotype, it may be beneficial to include both a 
threshold on hemoglobin and the note concept anemia because 
the two appear to act somew^hat independently despite their 
obvious clinical relation. 
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Figure 3 Clustering of laboratory values based on their lagged linear 
correlations with ambulatory surgery events. The x-axis shows the 
unitless single-link distance, with length of the horizontal line in the 
dentrogram representing the distance between the connected clusters. 
Compared to figure 2, note the change in the clusters, with partial 
thromboplastin time (PTT), and international normalized ratio (INR) still 
close but with prothrombin time (PT) in the distance. ALT, alanine 
aminotransferase; AST, aspartate aminotransferase; CK, creatine kinase; 
RBC, red blood cell; WBC, white blood cell. 



Our current approach is limited in the number of healthcare 
process events that w^ere used and the number of EHR variables 
that were studied. We believe that an effective approach w^ill 
require a larger number of disparate healthcare process events 
and should be appHed to a large cohort of EHR variables. 
Another limitation is that many of our concepts and our corpus 
are primarily from the inpatient setting (resident signout notes 
are in fact occasionally used in the resident clinic setting). 
Nevertheless, our correlations w^ith outpatient healthcare events 
still produced reasonable clusters. Our work was carried out at 
one academic medical center, and w^e do not yet know^ w^hich 
healthcare processes w^ill be generalizable. We believe that the 
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Figure 4 Clustering of concepts and laboratory values based on their lagged linear correlations with inpatient discharge events. The x-axis shows 
the unitless single-link distance, with length of the horizontal line in the dentrogram representing the distance between the connected clusters. 
Laboratory values tend to cluster together (bracket) and note concepts tend to cluster together, although prothrombin time (PT), international 
normalized ratio (INR), and chloride (at arrows) cluster away from the other laboratory variables. Within data type, ulcer, emesis, diarrhea, cirrhosis, 
pancreatitis, nausea, and vomiting appear near each other, as do hypotension, lasix, atrial fibrillation (AFIB), and digoxin. afib, atrial fibrillation; alt, 
alanine aminotransferase; ast, aspartate aminotransferase; ck, creatine kinase; cmv, cytomegalovirus; copd, chronic obstructive pulmonary disease; 
cri, chronic renal insufficiency; cva, cerebrovascular accident; hctz, hydrochlorothiazide; inr, international normalized ratio; mrsa, methicillin-resistant 
Staphylococcus aureus; pt, prothrombin time; ptt, partial thromboplastin time; rbc, red blood cells; tb, tuberculosis; uti, urinary tract infection; vtach, 
ventricular tachycardia; wbc, white blood cells. 
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Figure 5 Clustering of concepts and laboratory values based on their lagged linear correlations with emergency department events. The x-axis 
shows the unitless single-link distance, with length of the horizontal line in the dentrogram representing the distance between the connected 
clusters. Compared to figure 4, the laboratory results are further dispersed (brackets and arrows), afib, atrial fibrillation; alt, alanine 
aminotransferase; ast, aspartate aminotransferase; ck, creatine kinase; cmv, cytomegalovirus; copd, chronic obstructive pulmonary disease; cri, 
chronic renal insufficiency; cva, cerebrovascular accident; hctz, hydrochlorothiazide; inr, international normalized ratio; mrsa, methicillin-resistant 
Staphylococcus aureus; pt, prothrombin time; ptt, partial thromboplastin time; rbc, red blood cells; tb, tuberculosis; uti, urinary tract infection; vtach, 
ventricular tachycardia; wbc, white blood cells. 
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Figure 6 Comparison of profiles of lagged linear correlation with five 
healthcare (HC) events for partial thromboplastin time (PTT) and 
international normalized ratio (INR). Each graph shows the lagged 
linear correlation curve for a concept-healthcare event pair. In each 
graph, the horizontal axis covers -60 to +60 days with 0 days at the 
midpoint, and the horizontal line represents a correlation of zero with 
positive correlation above the line. The profiles are well matched for 
INR and PTT. ED, emergency department. 

high-level concepts, such as the overnight measurement of 
patients who are more ill, w^ill be broadly applicable to other 
centers. We have made our code available via GitHub (github. 
org) and MATLAB Central (http://vwvv. math w^orks. com/ 
matlabcentral), and verification via national efforts like the 
eMERGE netw^ork or OMOP w^ould be beneficial. 
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Figure 7 Comparison of profiles of lagged linear correlation with five 
healthcare (HC) events for nausea and vomiting. Each graph shows the 
lagged linear correlation curve for a concept-healthcare event pair. In 
each graph, the horizontal axis covers -60 to +60 days with 0 days at 
the midpoint, and the horizontal line represents a correlation of zero 
with positive correlation above the line. The profiles are well matched 
for nausea and vomiting. ED, emergency department. 
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Figure 8 Comparison of profiles of lagged linear correlation with five 
healthcare (HC) events for hemoglobin laboratory test and the anemia 
note concept (ANEM). Each graph shows the lagged linear correlation 
curve for a concept-healthcare event pair. In each graph, the horizontal 
axis covers -60 to +60 days with 0 days at the midpoint, and the 
horizontal line represents a correlation of zero with positive correlation 
above the line. These profiles illustrate a limitation of our technique. 
The curves cluster away from each other but manual review reveals that 
they are indeed somewhat similar except for a reversal of sign. Low 
hemoglobin corresponds to more anemia. ED, emergency department. 

In summary, correlating EHR variables w^ith healthcare 
process events produced sensible grouping of variables, but 
appeared to be highly sensitive to the manner in w^hich the vari- 
ables w^ere collected. We believe that it may be possible to 
exploit this sensitivity to improve the phenotyping process, and 
that the approach may point the w^ay in the longer run to a 
more automated and reliable phenotyping process. 
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